THE BELL SYSTEM 

TECHNICAL JOURNAL 

DEVOTED TO THE SCIENTIFIC AND ENGINEERING 
ASPECTS OF ELECTRICAL COMMUNICATION 

Volume 49 February 1970 Number 2 

Copyright © 1970, American Telephone and Telegraph Company 

On the Interaction of Roundoff Noise 
and Dynamic Range in Digital Filters* 

By LELAND B. JACKSON 

(Manuscript received October 22, 1969) 

The interaction between the roundoff -noise output from a digital filter 
and the associated dynamic-range limitations is investigated for the case 
of uncorrelated rounding errors from sample to sample and from one error 
source to another. The required dynamic-range constraints are derived in 
terms of L p norms of the input-signal spectrum and the transfer responses 
to selected nodes within the filter. The concept of "transpose configurations" 
is introduced and is found to be quite useful in digital-filter synthesis; for 
although such configurations have identical transfer functions, their round- 
off-noise outputs and dynamic-range limitations can be quite different, 
in general. Two transpose configurations for the direct form of a digital 
filter are used to illustrate these results. 

I. INTRODUCTION 

With the rapid development of digital integrated circuits in the 1960's 
and the potential for large-scale integration (LSI) of these circuits in 
the 1970's, digital signal processing has become much more than a tool 
for the simulation of analog systems or a technique for the implementa- 
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tion of very complex and costly one-of-a-kind systems alone. The 
traditional advantages of digital systems, such as high accuracy, stable 
parameter values, and straight-forward realization, have been supple- 
mented through the use of integrated circuits by the additional advant- 
ages of high reliability, small circuit size, and ever-decreasing cost. 
As a result, it now appears that many signal processing systems which 
have been in the exclusive domain of analog circuits may in the future 
be implemented using digital circuits; while other proposed systems 
which could not be implemented at all because of the practical limita- 
tions of analog circuits may now be realized with digital circuits. 2 

The key element in most of these new signal-processing systems is 
the digital filter. The term "digital filter" here denotes a time-invariant, 
discrete or sampled-data filter with finite accuracy in the representation 
of all data and parameter values. 3-5 That is, all data and parameters 
within the filter are "quantized" to a finite set of allowable values with, 
in general, some form of error being incurred as a result of the quantiza- 
tion process. Implicit in this quantization is a maximum value or set 
of maximum values for the magnitudes of these data and parameters 
which, in the case of the data, is usually referred to as the "dynamic 
range" of the filter. 

Without the above quantization effects, linear discrete filters could 
be implemented exactly. Of course, one very significant feature of 
digital signal processing is that arbitrarily high accuracy can, in fact, 
be maintained once the initial analog-to-digital (A-D) conversion (if 
any) has taken place. However, there are still practical limitations to 
the accuracy of any physical system, and often it is desirable to mini- 
mize the accuracy of the implementation (while still satisfying the 
system specifications) in order to minimize the cost of the system. Hence, 
a thorough understanding of quantization errors in digital filters is 
quite important if the full potential of digital signal processing is ever 
to be realized. 

II. QUANTIZATION ERRORS IN DIGITAL FILTERS 

The specific sources of quantization error in the implementation and 
operation of a digital filter are as follows: 

(i) The filter coefficients (multiplying constants) must be quantized 
to some finite number of digits (usually binary digits, or bits). 

(it) The input samples to the filter must also be quantized to a 
finite number of digits. 

(Hi) The products of the multiplications (of data by coefficients) 
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within the filter must usually be rounded or truncated to a smaller 
number of digits. 

(iv) When floating-point arithmetic is used, rounding or truncation 
must usually be performed before or after additions as well. 

The first source of error above is deterministic and straightforward 
to analyze in that the filter characteristics must simply be recomputed 
to reflect the (small) changes in the filter coefficients due to quantizing. 8,7 
However, the inclusion of coefficient quantization in the initial filter 
synthesis procedure in order to minimize (in some sense) the resulting 
filter complexity produces a complex problem in nonlinear integer 
programming which has only begun to be investigated. 

The second source of error is often referred to as "quantization noise". 
It is inherent in any A-D conversion process and has been studied in 
great depth. 8 Hence, input quantization has not been included in our 
investigation, except as it relates to other error sources of interest. 

The third and fourth error sources are similar to the second since they 
also involve quantization of the data, but they differ in two respects: 
(i) The data to be quantized is already digital in form, and (ii) the round- 
ing or truncation of the data takes place at various points within the 
filter, not just at its input. To distinguish these sources of error from the 
input quantization noise, the resulting error processes will be referred 
to as "roundoff noise" (to be used generically, whether rounding or 
truncation is actually employed). Because of (ii), the roundoff noise is 
potentially much larger than the input quantization noise, and it is one 
of the principal factors which determine the complexity of the digital 
filter implementation, especially when special-purpose hardware is used. 

There are three variables in the filter implementation which deter- 
mine the level and character of the roundoff noise for a given input signal: 

(i) the number of digits (bits) used to represent the data within 
the filter, 

(ii) the "mode" of arithmetic employed (that is, fixed-point or 
floating-point), and 

(Hi) the circuit configuration of the digital filter. The number of 
digits in the data may be thought of as determining either the quantiza- 
tion step size or the dynamic range of the filter. We choose here the 
latter interpretation in order to have the same step size for all filters. 
Therefore, with this interpretation, the number of data digits does not 
affect the level of the roundoff noise directly, but rather it limits the 
maximum allowable signal level and hence the realizable signal-to-noise 
ratio. Data within the filter must, of course, be properly "scaled" if the 



162 THE BELL SYSTEM TECHNICAL JOURNAL, FEBRUARY 1970 

maximum signal-to-noise ratio is to be maintained without exceeding the 
dynamic-range limitations. Among the principal results reported here 
are the determination of appropriate scaling for certain important classes 
of input signals and the calculation of the effect of this scaling on the 
output roundoff noise. 

The output roundoff noise from a floating-point digital filter is usually 
(but not always) less than that from a fixed-point filter with the same 
total number of data digits because of the automatic scaling provided 
by floating-point arithmetic. 9,10 However, since floating-point arithmetic 
is significantly more complex and costly to implement, most special- 
purpose digital filters have been, and will probably continue to be, con- 
structed with fixed-point hardware. Hence, we have considered only 
fixed-point digital filters in this work although much of the analysis 
could be adapted to floating-point filters. Oppenheim has recently 
proposed another interesting mode of arithmetic for digital filter im- 
plementation, called "block-floating-point", which provides a simplified 
form of automatic scaling of the filter data. 11 As would be expected, the 
performance of block-floating-point appears to lie somewhere between 
those of fixed-point and of floating-point. 

The third variable in the implementation of a digital filter, that of 
circuit configuration, is the principal factor determining the character 
(spectrum) of the output roundoff noise and, along with mode of the 
arithmetic, ultimately determines the number of data digits required to 
satisfy the performance specifications. In fact, the key step in the syn- 
thesis of a digital filter is the selection of an appropriate configuration 
for the digital circuit. There are a multitude of equivalent circuit con- 
figurations for any given linear discrete filter (whose transfer function 
is expressible as a rational fraction in z) ; but in the implementation of the 
corresponding digital filter, these configurations are no longer equivalent, 
in general, because of the effects of coefficient quantization and roundoff 
noise. As noted previously, the effects of coefficient quantization are 
deterministic and can thus be accounted for exactly as a (typically 
small) change in the transfer function of the discrete filter. Therefore, 
assuming that the coefficients for the configurations under consideration 
have been (or can be) quantized satisfactorily, the choice between these 
configurations is then determined by the level and character of their 
output roundoff noise. As we will show, there can be very significant 
differences between the roundoff-noise outputs of otherwise equivalent 
digital filter configurations. 

The content and complexity of any analysis of roundoff noise are 
determined to a large extent by the assumed correlation between round- 
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off errors. If these errors may be assumed to be uncorrelated from sample 
to sample and from multiplier (or other rounding point) to multiplier, 
then the roundoff-noise analysis is relatively straightforward, and the 
results are independent of the exact nature of the input signal to the 
filter. If, on the other hand, uncorrelated errors may not be assumed, 
then the analysis is much more complex, and the results are generally 
dependent on the particular input signal or class of input signals. This 
paper is concerned exclusively with the uncorrelated-error case because 
this assumption seems to be valid for most niters with input signals of 
reasonable amplitude and spectral content. Even in this case, the in- 
clusion of the associated dynamic-range constraints makes the analysis 
reasonably involved and the corresponding synthesis problem quite 
complex. 

Although the generic term "roundoff noise" has been used to include 
the case of truncation as well as rounding, we actually concentrate on 
the rounding case. As long as the assumption of uncorrelated errors can 
be made, our results are applicable to either case, with the error variance 
for truncation being four times that for rounding. However, as the 
input signals become less "random", the uncorrelated-error assumption 
tends to break down for truncation more readily than for rounding. 
Hence, additional care must be exercised in applying these results to the 
truncation case. 

III. FILTER MODEL FOR TJNCORRELATED-ROTJNDOFF-NOISE ANALYSIS 

The analyses appearing in the literature concerning roundoff noise 
in digital niters usually employ the simplifying and often reasonable 
assumption of uncorrelated roundoff errors from sample to sample and 
from one error source (multiplier or other rounding point) to 
another. 9 ,12,13 This assumption is based on the intuitively plausible 
and experimentally supported notion that for sufficiently large and 
dynamic signals within the filter, the small roundoff error made at one 
point in the network and/or in time should have little relationship to 
(that is, correlation with) the roundoff error made at any other point 
in the network and/or time. The advantage of assuming uncorrelated 
errors from one sample to another is that the noise injected into the 
filter by each rounding operation is then "white"; while the advantage 
of assuming uncorrelated error sources is that the output noise power 
spectrum may then be computed as simply the superposition of the 
(filtered) noise spectra due to the separate error sources. 12 Experimental 
results which support the validity of this assumption, even in the case 
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of a single sinusoidal input, are presented in Ref. 1. In this section, we 
introduce the notation and develop the analysis pertaining to uncor- 
related roundoff noise for later use in investigating the synthesis of 
digital niters. 

Digital filter networks are composed of three basic elements: adders, 
constant multipliers, and delays. The interconnection of these elements 
into a particular network configuration is the key step in digital filter 
synthesis. For our purposes here, we need only consider the network 
as a directed graph, with the multipliers and delays being represented 
by graph branches. The branch interconnection points, or nodes, will 
be divided into two types: "summation nodes", which correspond to 
the adders and have multiple inputs and a single output, and "branch 
nodes", which correspond to simple "wired" interconnections that have 
a single input and one or more outputs. 

A digital filter network may thus be represented as shown in Fig. 
1. The input to and output from the filter at time t = nT are denoted 
by u(n) and y(n), respectively. The corresponding output from the 
i th branch node is denoted by t\-(w); while the roundoff error introduced 
into the filter at the j th summation node is denoted by e,(n). Since 
with fixed-point arithmetic, rounding is performed only after multiplica- 
tions, non-zero roundoff errors are "input" to the filter only at those 
summation nodes which follow constant (non-integer) multiplier 
branches, as depicted in Fig. 2. 
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Fig. 1 — General digital filter model. 
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Fig. 2 — Constant multiplier with preceding branch node and succeeding sum- 
mation node. 

For a unit sample input to the filter at t = and no rounding [that is, 
u(0) = I, «(n) = for re ^ 0, and e,(re) = for all ; and re], the resulting 
output values y(n) and v,(re) for all re ^ and all i are designated as 
h(n) and /,(re), respectively. Alternatively, for a unit sample input to 
the j th summation node and zero inputs otherwise [that is, e,(0) = 1, 
e,(re) = for re ^ 0, and e k (n) = re(re) = for all re and for k ^ ;'], the 
resulting output values y(n) for all re ^ are denoted by ft(n). We 
thus have the following transfer functions of interest, expressed in 
z- transform form: 

From filter input to output: 

H*(z) = E/i(re)2" n . (1) 

n = 

From filter input to i th branch-node output: 

F%z) = 2 f*(»)*""- ( 2 ) 

n = 

From j* h summation-node input to filter output: 

n=0 

These transfer functions are indicated in Fig. 1. 

The frequency responses (Fourier transforms) corresponding to the 
above transfer functions are given by 

ff(«) = H*(e iaT ), (4) 

F 4 («) = Ff( e '" u7 -), (5) 

(? t ( w ) = G?(e ,ur ). (6) 

This notation will be used throughout this paper. That is, for any 
2-transform A*(z) which converges for | s | = 1, the corresponding 
Fourier transform is given by 

A(oi) = A*(e iuT ). 
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If scaling has been included in the filter design in order to satisfy certain 
dynamic-range constraints, then prime marks (') are added to denote 
this fact [for example, F$(«), Ff'(z)]. 

Each error source (rounding operation) within the filter is assumed 
to inject white noise of uniform power-spectral density N • Assuming 
uniformly distributed rounding errors with zero mean, the variance of 
the roundoff noise from each error source is given by 12,13 

A = A 2 /12 (7) 

where A is the spacing of the quantization steps (after rounding). To 
eliminate the sampling period T from certain expressions of interest, 
we now define iV = o- 2 , . Hence, the variance, or total average power, 
corresponding to an arbitrary power-density spectrum N(w) with no 
DC component (which implies a zero-mean process) is given by f 

a 2 = 1 f ' N(a>) dw (8) 

to, J 

where o>, is the radian sampling frequency given by 

w. = 2tt/T. (9) 

Assume now that /c,- error sources input to the j th summation node. 
The spectral density of the roundoff error sequence {e,(n)} is then just 
kjN by our assumption of uncorrected error sources. The total roundoff 
noise in the output of the filter thus has a power-density spectrum given 
by 12 

NM = *l E *i I GiM I 2 (10a) 

i 

where we have substituted cr\ for N . If scaling has been included in the 
filter design, then the corresponding expression is just 

N v (a) =a 2 j: fcj | G'M | 2 (10b) 

i 

where fcj ^ fc,- to account for the additional scaling multipliers. 

IV. DYNAMIC-RANGE CONSTRAINTS 

The ultimate objective of the synthesis procedures to be investigated 
will be the minimization of some norm of N y (u>) for a given quantization 
step size A, subject to certain "constraints". One constraint is that the 



t Thia normalization of N(u) is further motivated by the derivation in Section 
V leading to equation (30b). 
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specified transfer function H*(z) must be maintained. Another funda- 
mental, but often overlooked, constraint is the finite dynamic range of 
the filter. Specifically, the signals v,(n) at certain branch nodes within 
the filter cannot be allowed to "overflow" (that is, exceed the dynamic- 
range limitations), at least not more than some small percentage of the 
time, in order to prevent severe distortion in the filter output. 

Overflow constraints are required only at certain branch nodes in the 
digital circuit because it is only the inputs to the constant multipliers 
which cannot be allowed to overflow when several standard numbering 
systems are used (for example, one's- or two's-complement binary). 14 
Specifically, in the summation of more than two numbers, if the magni- 
tude of the correct total sum is small enough to allow its representation 
by the K available digits, then in these numbering systems the correct 
total sum will be obtained regardless of the order in which the numbers 
are added, even if an overflow occurs in one of the partial sums. Hence, 
those node outputs which correspond to partial sums comprising a 
larger total sum may be allowed to overflow, as long as the total sum is 
constrained not to overflow. This property also applies when one of the 
inputs to a summation node has overflowed as a result of a multiplica- 
tion by a coefficient of magnitude greater than one. 

Turning to the formulation of the required overflow constraints, we 
may easily derive an upper bound on the magnitude of the signals 
Vi{n) for all possible input sequences \u(n)}, neglecting the (small) 
error signals e,(n). Assuming zero initial conditions in the filter and 
e,(n) = for all j and n, the i th branch-node output Vi(n) is given by 

»<(») = S/i(*Mn - k), all n. (11) 

* = o 

Therefore, given that u(n) is bounded in magnitude by some number 
M for all n, an upper bound on the magnitude of v { (n) is given by 16 

\Vi(n) I^Mi; |/.-(fc) |, all n. (12) 

i = 

Thus, if the node signal »«(») is also to be bounded in magnitude by 
M for all possible input sequences, the associated scaling must ensure 
that 

£|/S(*0|£1. (13) 

That (13) is not only a sufficient condition to rule out overflow for all 
possible input sequences \u(n) ) , but also a necessary condition, is easily 
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shown by letting u(n) = ±M for all n, with sgn [u(n — k)] = sgn [/<(&)] 
for some n = n and all /c ^ 0. Then from equation (11) we see that 
(12) is satisfied with equality in this case, and thus (13) is a necessary 
condition, as well. 

The norm of /((/c) employed in (13) is not very useful in practice 
because of the difficulty of evaluating the indicated summation in all 
but the simplest cases. Also, for large classes of input signals, (12) and 
thus (13) are overly pessimistic. Therefore, we now derive alternate 
conditions on (the transform of) the scaled unit-sample response { j\ (n) } 
which ensure that for certain classes of input signals, the corresponding 
branch-node output v,(n) cannot overflow. The derivation of these 
conditions for discrete systems closely parallels the corresponding 
derivation for continuous systems, as given by Papoulis. 16 

An alternate expression for equation (11) in terms of z-transforms is 
derived as follows: Consider an (absolutely summable) deterministic 
input sequence {u(n)\ possessing the z-transform 

U*(z) = 2 u{n)z- n , a < \z\ <b, (14) 

for some a < 1 and b > 1. Stability requires that F%(z), defined in equa- 
tion (2), exist for all | z \ > c for some c < 1. Hence, the z-transform of 
{Vi(n)\ is given by 3 

Vf(z) = F*{z)U*(z), d < \z | < b, (15) 

where d = max (a, c). The inverse transform of equation (15) is given 
by 3 

v .( n ) =-^-.<f> Vf (z)z n ~ 1 dz (16) 

ZlT] J r 

where the contour of integration r is contained in the region of con- 
vergence d < \z\ < b. Since d < 1 and 6 > 1, let r be the unit circle 
in the z plane (| z [ = 1), and perform the change of variables z = e' aT 
in equation (16). Using equation (15), the resulting equation becomes 

vM = — ["' Fi(o>)U(u>)e inuT do>. (17) 

CO, J 

The conditions to be derived from equation (17) are most easily 
expressed in terms of L p norms, denned for an arbitrary periodic function 
A(-) with period w, by 17 



A IL = 



[Uo' { A(u) {v du Y (i8a) 
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for each real p ^ 1 such that 

/ | A(w) | p dec < oo . 

It can be shown 17 that for A(-) continuous, the limit of equation (18a) 
as p — > oo exists and is given by 

|| A ||„ = max | A(<a) |. (18b) 

Assume now that | [/(to) | is bounded from above by some number 
M(that is, || U IU ^ M). Then, from equation (17), 



I w,-(n) | ^ M — f ' | F,(co) | cfco 

CO, Jo 



or 

|ff*(n)| ^ || F, It-ll £/|| M . (19) 

In exactly the same manner, we may also show that 

|v.(n)| ^ IIF.-IU-H 17 ||,. (20) 

Applying the Schwarz inequality to equation (17), on the other hand, 
yields that 



Vi(n) | a £ 4 f ' | Ffa) | 2 tfco /"" | U(p) 

CO, J J 



dv 



or 

|»<(n)| ^ ||F t -|| 2 -|| 17H,. (21) 

Note that (19), (20), and (21) are all of the form 

|»<60l ^ H^IU-II^II., (p+g" 1 ) W 

for p, q = 1, 2, and oo . It can be shown 18 that (22) is true in general for 
all p, q > 1 satisfying 1/p + 1/q = 1; and we have shown in (19) and 
(20) that if the !/«, norms exist, then (22) holds for p, q = 1, as well. 
The general relation in (22) for all p, q > 1, is derived from Holder's 
inequality. 

A simple, but important special case of (22) results from letting F* 4 (z) 
= Fi(u) = 1. Since || 1 ||„ = 1 for all p ^ 1, we then have simply 

l«(n)| S || 17 II. , aU q± 1. (23) 
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But since (23) holds for all sequences \u(n)}, it must also be true that 

\vi(n) | ^ || V t || ri aU r £ 1. 

This is, in fact, the basis of (22), for Holder's inequality actually states 
that 

l|ViH.«l!'<IMIir|l.. H + - q = 1 

Therefore, the real implication of (22) is that the mean absolute value 
of V<(«) is bounded by || F, ; || p || U || a , and this, in turn, provides a 
bound on | «,-(n) |. 

Assume, therefore, that the input transform U(u) satisfies || U ||, ^ M 
for some q ^ 1. From (23) we immediately have that | u(n) | ^ M for 
all n. Then, if | v,(w) | is also to be bounded by M, (22) provides a 
sufficient condition on the scaling to ensure this, namely 

II ^ lb SS X, (\\U\\ Q ^M) (24) 

for p = q/(q — 1). Inequality (24) is the desired condition to replace 
the more general, but often less useful condition given by (13). 

From an engineering viewpoint, the most significant values for p 
and q would seem to be 1, 2, and ■». The case p = 1, q = <» requires 
that the input transform £/(o>) be everywhere bounded in magnitude by 
M (that is, || U ||eo = M), in which case only the L x norm of the scaled 
transfer function F'(u) need satisfy (24). For an input of finite energy 
E = 2* H u 2 (n), Parseval's identity implies that || U \\\ = E, and thus 
with M ^ (£")*, (24) can be satisfied for p = q = 2. 

The case of p = °o , q = 1 in (24) implies the most stringent condition 
on F((co) because from equation (18) it is evident that 

IIJSIUS HJ3IL (25) 

for all p ^ 1. It is clear, for example, that for a sinusoidal input of 
amplitude A 22 M and arbitrary frequency a> , we must have | F'^oi) \ 
^ 1 for all co (that is, || F' { || M ^ 1) to ensure that | »<(») | ^ M for 
all n. However, a sinusoidal input sequence \u(n)\ is not absolutely 
summable, and thus U*(z) as defined in equation (14) does not exist 
in this case. This difficulty may be circumvented, as is common in 
Fourier analysis, by assuming a finite sequence of length N and then 
passing to the limit as N — > oo . The resulting (Fourier) transform of 
{ u(n) \ is of the form 

U (a>) = ^ e i9 [S(u - coo) + 5(co - a. + « )], (0 ^ co ^ w.) (26) 
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where 5(co) is the familiar Dirac delta function defined by 

5(w) =0, co ^ 0, 

(27) 
5(oj) dec = 1 . 



/: 



C/ (co) is, of course, periodic in to with period to, . From equations (18a), 
(26), and (27), we immediately have that || U \\i = A ^ M, and thus 
with p = oo , (24) is applicable for sinusoidal input sequences, as ex- 
pected. 

V. RANDOM INPUT CASE 

In the case of random input sequences, (24) is not directly applicable 
because the z-transform U*(z) is not denned. Similar conditions may be 
obtained, however, by considering the discrete autocorrelation function 
<p(-), denned for a (wide-sense) stationary sequence \iu(n)\ by 

<p„(m) = E[w(n)w(n + m)\ (28) 

where E[-\ is the expected-value operator. A z-transform $*(z) may be 
denned for the sequence \<p„{vi)\ as in equation (14) with an inverse 
transform as in (16). Assuming ergodicity and a zero mean (E[w(n)] = 0) 
for \w(n)}, we immediately have from equation (28) that the variance, 
or total average power, of {w(n)\ is given by 

*„(0) = E[w\n)] = al , (29) 

and from equation (16) we also have 

*.(0) = 7T-. 4 *ZW* dz - ( 30a ) 

27rj J r 

Letting r be the unit circle (z = e' uT ), equations (29) and (30a) imply 
that 

<rl = 1 f ' $ tt ( w ) dco. (30b) 

CO, Jo 

Hence, from equation (8) we see that $ w (co) is just the power-density 
spectrum of the sequence | w{n) \ . 

For an input sequence \u(n)\ whose autocorrelation function has 
the z-transform $*(z), it is well-known that the corresponding transform 
for the output [v,(n)} is given by 

•f,W = F*{z)F*{z~ l )^*(z) (31a) 
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or 

*„(«) = | F,(u>) | 2 $„(«). (31b) 

Equations (29) through (31) imply then that 

al { = 1 [ ' | F.( u ) | 2 $ u ( u ) do,. (32) 

W, J 

Since equation (32) is of the same basic form as (17), a derivation similar 
to that leading to (22) must yield the following relations for p, q ^ 1: 

Asll*!IL-ll«UI.. (l + - g = 1 ) (33a) 

or, from equation (17), 



■ -„-„.,. a . i 



A ^ 11^* lll-ll Ml., £ + ;- V" (33b) 

Two cases of (33) are of particular interest, namely 

<r?< ^ || F, HMI*. ||- (34) 

and 

<rf, ^ ||F, ||l- 1| $„ Id . (35) 

In view of equation (25), we see that (34) implies the most stringent 
condition on the input spectrum $ u (w), whereas (35) yields the most 
stringent condition on the transfer function F*(w). From (34) and (30b), 
for example, we have that if the input power-density spectrum is "white" 
[that is, $„(&>) = a 2 u for all w], then a ti 25 || F, |||<r2 . Hence, if the input 
sequence \u(n)} is a Gaussian process, 19 the node output sequence 
Jy,(n)} will overflow no more (in percentage of time) than does the 
input, provided only that 

II ?S Hi SI. (36) 

The inequality in (35) requires, on the other hand, that for an input 
sinusoid of arbitrary amplitude and frequency, F<(co) must satisfy 

||?SlL*l (37) 

to ensure against overflow, as we have seen earlier from (24) . 
To summarize, dynamic-range constraints of the form 

II fj II, SI, P^l (38) 

have been derived for both deterministic and random inputs, where 
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Fttu) is the (scaled) transfer response from the filter input to the i th 
branch node and || • || p denotes the L p norm defined in equation (IS). 
For a deterministic input with amplitude spectrum £/(o>), (38) assumes 
that 

\\u\\ q ^m, ? = ^r-i> ( 39 > 

where M is the maximum allowable signal amplitude. For a random 
input, on the other hand, the use of (3S) requires appropriate conditions 
on || $ u || r , r = p/(p — 2) and p ^ 2, where <i> u (w) is the power-density 
spectrum of the input sequence. 

The effect of (38) and (39) is to bound the mean absolute value of the 
amplitude spectrum at the i th branch node (that is, || V< ||i) which, in 
turn, bounds the peak signal amplitude at that node. The use of (38) 
in conjunction with (33), however, bounds only the average power at 
the ** branch node, and thus the relationship between this average 
power and the peak signal amplitude at the node must also be deter- 
mined in order to provide an effective dynamic-range constraint. 

VI. TRANSPOSE SYSTEMS 

In the evaluation of different circuit configurations for a given digital 
filter, a useful concept relating certain of these configurations is that 
of "transpose configurations". This relationship is a general property 
of linear graphs 20 and will be presented here in terms of a state-variable 
formulation. 

The general state equations for a linear, time-invariant discrete system 
are given by 21 

x(n + 1) = Ax(n) + Bu(ri), ^ 

y ( n ) = Cx(n) + Du(n) 

where x(n) is an JV-dimensional vector describing the state of the system 
at time t = nT, u(?i) is the corresponding ./-dimensional input vector, 
y(n) is the corresponding /-dimensional output vector, and A, B, C, 
and D are fixed parameter matrices of the appropriate dimensions relat- 
ing the input, state, and output vectors as given by equation (40). 
The (N + /) X (N + J) matrix S defined by 



S = 



A B 
C D 



(41) 
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provides a convenient single parameter matrix which describes the 
complete discrete system. 

A transfer function matrix 3C% (z) may be defined for the system (de- 
scribed by) S relating the input and output vector sequences |u(n)} 
and (y(n)} by 



Y*(z) = K%(z)XS*(z) 



(42) 



where XJ*(z) and Y*(z) are the vector 2-transforms of (u(n) } and (y(n) } , 
respectively. X% (z) is readily shown to be given by 21 

3C%(z) = C(zl - A)-*B + D (43) 

where (-) -1 denotes the matrix inverse and / is the iV-dimensional 
identity matrix. 

Consider now a new system which is described by the parameter 
matrix S', that is, 



S' = 



A 1 
B l 



C 

D l 



(44) 



where (•)' denotes the matrix transpose. From equations (41) and (43) 
it is easily seen that the transfer function matrix for the new system 
S' is given by 

3C* s ,(z) = B\zl - ATC + D* (45) 

= [3eS(*)] ( . 

Thus, the transfer function matrix for the system S' is simply the 
transpose of the transfer function matrix for the system S. That is, 
the element H*(z) from 3Cf (z), which is the transfer function from the 
; th input to the i th output of system S, equals the element H*\{z) from 
3C%,(z), that is, the transfer function from the i th input to the j th output 
of S'. Note also that while the system S has a total of J inputs and / 
outputs, the system S l has I inputs and J outputs. 

The concept of transpose systems will be particularly useful to us in 
conjunction with the digital-filter model introduced in Section III and 
depicted in Fig. 1. Defining the input and output vectors for the filter by 



u(n) = 



u(n) 
e,(n) 

lej(n)j 



and y(n) = 



y(n) 

y,(n) 



vi{n)_ 



(46) 
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respectively, the transfer function matrix for the filter is given by 

H*(z) G*(z) ■■■ G*Az) 
Ff(z) 



3C*(z) = 



.Ff(z) 



(47) 



where the specific expressions for the elements in other than the first row 
and first column are unimportant for our purposes. By equation (45), 
the transfer function matrix for the corresponding transpose system is 
then simply 



0C?(z) = 



H*(z) F*(z) 
G?(*) 



Ff(z) 



G*Az) 



(48) 



Note, in particular, that the transfer function from input-1 to output-1 
[that is, H*(z), the ideal transfer function from filter input to filter 
output] is the same for both systems. 

As discussed more fully in Ref. 1, the circuit configuration realizing a 
given system S is not necessarily unique, and hence neither is the con- 
figuration for the transpose system S*. However, given a particular 
configuration for the system S, a unique "transpose configuration", 
which realizes S l , may be derived from the given configuration for S 
by simply reversing the direction of all branches in the given network! 
In particular, then, all delays and constant multipliers remain the same 
except for the change in direction. All summation nodes in the given 
configuration become branch nodes in the transpose configuration, and 
all branch nodes become summation nodes. Likewise, all inputs in the 
given configuration become outputs in the transpose configuration, and 
all outputs become inputs. 1 

That the transpose configuration defined above actually realizes the 
transpose system S l is easily seen by considering the state equations in 
(40). The constant multiplier(s) corresponding to the element da of the 
matrix D and relating the j th input and the i th output of the original 
configuration must relate the i th input and the j th output of the transpose 

t Note that the transpose system S' is fundamentally different from the "ad- 
joint" system 22 because, although the signal flow is reversed in both, the trans- 
pose system does not run "backwards in time." 
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configuration, and thus da = dj-,- for all i and j. The multiplier (s) cor- 
responding to the element 6,-y of B and relating the j th input and the 
i th state of the original configuration must, on the other hand, relate 
the i th state and the j th output of the transpose configuration, and thus 
ba = c\i for all i and j. Similarly, c,-,- = &},• for all i and j. Finally, the 
multiplier(s) corresponding to a,-,- and relating z,(n) and x,-(n + 1) in 
the original configuration must, in the transpose configuration, relate 
Xi(n) and x t (n + 1), and thus a,-,- = a}, for all i and j. Therefore, the 
transpose configuration indeed realizes the system S'. 

VII. AN EXAMPLE." THE DIRECT FORM 

To demonstrate the application of the results of the preceding sections, 
we now evaluate and compare the roundoff-noise outputs from two 
transpose configurations for a digital filter. The scaling required to 
satisfy the overflow constraints in (38) is derived, and the effect of this 
scaling on the output roundoff noise is determined. 

The transfer function H*(z), defined in equation (1) and relating 
the input and output of the digital filter, may be expressed as a rational 
function in z of the form 3,4 

N 

H*(z) = ^^ = §|f (49) 

Assuming that a N and b N are not both zero, N is referred to as the "order" 
of the filter. There are many different, but equivalent, forms in which 
equation (49) may be written, with a number of equivalent circuit 
configurations corresponding to each of these forms (at least two trans- 
pose configurations). Those forms such as equation (49) which require 
the minimum number of multiplications and additions in the general 
case (that is, 2N + 1 and 2N, respectively) are referred to as "canonical" 
forms. In general, however, it is necessary to add additional scaling 
multipliers to these canonical forms in order to satisfy the overflow 
constraints in (38). 

The form of H*(z) given in equation (49) is often called the "direct 
form" of a digital filter. It has been pointed out by Kaiser 6 that use of 
the direct form is usually to be avoided because of the sensitivity of the 
roots of higher-order polynomials to small variations (that is, quantiza- 
tion errors) in the polynomial coefficients. The roundoff-noise outputs 
from the direct form can also be much larger than from other canonical 
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forms. 15 Nevertheless, the direct form is of theoretical interest, and it 
provides a convenient illustration of our results. Similar investigations 
for the two canonical forms most commonly employed in practice — the 
cascade and parallel forms — are described in Ref. 1. 

Two transpose configurations which implement the direct form with 
scaling are shown in Figs. 3 and 4. These configurations actually realize 
H*(z) in the form 



Ki E 



kd<Z 



H*(z) = 



i + E te~' 



(50) 



where *a' = a { /K' h , and the additional scaling multipliers K' k , k = 
1, 2, are required to satisfy (38) in the general case. The configuration 
in Fig. 3 will be designated as form 1 (that is, k = 1), and Fig. 4 as 
form 2 (that is, k = 2). 

The branch nodes at which overflow constraints are required (because 
these signals input to multipliers) are indicated by (*). The dynamic- 
range limitations are obviously satisfied (by assumption) at the input to 
the filter, but for completeness, an overflow constraint is included there 
as indicated. The scaled transfer responses ^(w) to these nodes are 
noted in Figs. 3 and 4, and the corresponding unsealed responses iF,(a>) 
apply, of course, when K' h = 1. 



u(n 




Fig. 3 — Direct form 1 with scaling. 
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u(n) 




y(n) 



Fig. 4 — Direct form 2 with scaling. 

It is intuitively clear that to preserve the greatest possible signal- 
to-noise ratio, the scaling should reduce the magnitude of *.FJ(co) no more 
than is necessary (or should increase it as much as possible, as the case 
may be). In other words, k F'i(o)) should satisfy 

II^SlU-i- ( 51 ) 

This condition will be satisfied if the scaling factors k s { , defined by 

JP'M = *,. »**«(«), (52a) 

are given by 

kSi = 1/||^||,. (52b) 

It is readily seen from Figs. 3 and 4 that 

,/?» = fib) = 1, (53) 

and hence equation (51) is automatically satisfied for these responses. 
Of more interest, however, are the responses 



B((a) 



(54) 
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and 

,*>) - ff = ^- (55) 

From equations (52), (54), and (55), it follows that (51) is satisfied 
for these configurations if (and only if) 

K[ = 1/|| 1/5 || p (56) 

and 

K' 2 = \\H || p . (57) 

The rounding-error inputs e,(w) are also shown in Figs. 3 and 4 
along with the transfer responses «,(?$(«) from these inputs to the output 
of the filter. Note that in form 2 (Fig. 4) the error input e 2 (n) incor- 
porates the roundoff errors from all of the multipliers except K' 2 even 
though these error sources are separated by delays (z' 1 ). This is done 
for convenience and is possible because of the assumption of uncorrelated 
errors from sample to sample and source to source. The noise weights 
k'j [see equation (10)] for form 1 are thus 

,/c[ = Jai - N + 1; (58a) 

while for form 2, 

2 k[ = 1 and 2 k' 2 = 22V + 1. (58b) 

The indices i and j of the *F,(a>) and *(?,-(&») have been assigned in such 
a way that forms 1 and 2 are related as in equations (47) and (48). 
That is, these unsealed responses satisfy the following equations: 

yFiiu) = ,(?,(«), i = 1,2, (59a) 

!(?,(»)- 9 F,( W ), i = 1, 2. (59b) 

Note that the scaled responses *F{(«) and t 6j(w) are not related as in 
equation (59) because, in general, K[ 9± K' 2 . In particular, 

,(?((«) -^- @)^(«); (60) 

while 

- G5 ^ - £ - (!V« w) - (61) 

However, we do have, as in equation (53), that 

&(<*) = ,(?{(«) = 1. (62) 
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From equations (10) and (53) through (62), the power-spectral den- 
sities of the roundoff-noise outputs from these two configurations are 
thus computed to be 



&,(<*) = crS(tf +1)<1 + 



and 



AT» = crftl + (2N + 1) \\H 



H( a ) 



B( a ) 



(63a) 



(63b) 



The variances, or total average powers of the output roundoff noise 
from these configurations are then, from equations (8) and (18), simply 



itf.lli = oi(N + 1K1 + 



and 



N.Wi = ertfl + (2tf + 1) \\H 



II 



(64a) 



(64b) 



The peak noise densities || k N„ \\ m are, on the other hand, bounded by 



and 



iN. 



*N« 



^ al(N+ 1)U + 



S <ro\l + (2N + 1) \\H 



H 



(65a) 



(65b) 



We now compare direct forms 1 and 2 on the basis of (64) and (65). 
Although comparisons based on bounds for || k N v \\ m as in (65) do not, 
of course, necessarily hold for || k N v || M itself, experimental results have 
indicated that such comparisons are quite effective qualitatively, and 
often quantitatively as well. 1 Consider first the expressions in equation 
(64) f or p = 2 and in (65) f or p = °o (that is, || JV„ || P , r = 1, oo, for 
p = r + 1]. In these two cases, the only difference between the (a) 
and (b) expressions for forms 1 and 2, respectively, are the k' { , as given 
in equation (58). In particular, for || \/B \\ 2 V \\ H \\l » 1 as is often the 
case, the || N„ || P for form 1 are approximately half, or 3 db less than, 
those for form 2. This result simply reflects the fact that only half of the 
noise sources in form 1 input at other than the filter output; whereas 
in form 2, all but one input within the filter. Hence, if the gains from 
these inputs to the output are large, form 1 is preferable to form 2 by 
up to 3 db. 
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For p 5^ r ■+• 1, however, the differences in the /cj are of secondary 
importance compared with the potential differences due to the mixture 
of L 2 and L M norms in (64) and (65) . In particular, letting 



H 



(66a) 



we immediately see that if 6 M2 » 02» , then form 2 is better for p = <x> 
while form 1 is better for p = 2. If, on the other hand, d M2 <3C 6 2a3 , then 
the opposite applies. 

To gain insight into the above conditions, we rewrite equation (66a) 



as 



A 
B 



(66b) 



It is then clear that the difference between m2 and 2co is due entirely 
to the effect of A(a)) on the L a norms of A(w)/B(<a) for q = 2, «> versus 
the corresponding norms of 1/B(w). In particular, A(ia) affects the L„ 
norm in 2oo . But the L„ norm of a function "concentrates" exclusively 
on the maximum absolute value of that function; whereas the L 2 norm 
of a function reflects the r.m.s. absolute value of that function over 
all argument values. Therefore, the effect of A(u>) in 2oo results from the 
alteration of the maxima of | l/B(u>) | in | A(co)/B(u) |; while in m2 , 
the effect concerns the difference between | 1/B(cc) | and | A(co)/B(co) | 
over all to. 

Intuitively, one expects that the former effect is potentially much 
greater; that is, in many cases A(co) should affect the L„ norm in 6 2a> 
much more than the L 2 norm in m2 . In particular, if | A (co) | signifi- 
cantly attenuates the maxima of | 1/B(w) | [as in a band-rejection filter, 
for example], then 2oo should be much smaller than 0002 . In this case, 
form 2 should be used for p = » , and form 1 for p = 2. If, however, 
| A(u>) | does not provide such attenuation, then | A(o>) | must be rela- 
tively constant within the band(s) where | 1/5 (w) | is largest [by the 
nature of A(o>)], and hence 



A 
B 



A(« ) 



(67) 



where co is a frequency at or near a maximum of | l/B (co) |. But then, 



A(<a ) 



(68) 



and the difference between direct forms 1 and 2 should be less in this 
case. 
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VIII. SUMMARY 

The interaction between the roundoff-noise output from a digital 
filter and the associated dynamic-range limitations has been investigated 
for the case of uncorrelated rounding errors from sample to sample and 
from one error source to another. The spectrum of the output roundoff 
noise from fixed-point implementations was readily shown to be of the 
form 

NM = 4 £ fcj | G'M | 2 (69) 

i 

where the G<(o>) are scaled transfer responses from certain "summation 
nodes" in the digital circuit to the filter output, crl is the variance of the 
rounding errors from each multiplier (or other rounding point), and the 
k'j are integers indicating the number of error inputs to the respective 
summation nodes. 

Defining FJ(oj) to be the scaled transfer response from the input to 
the i th "branch node" at which a dynamic-range constraint is required, 
constraints of the form 

WF'M^l (70) 

f or p ^ 1 were then derived, where || F'i \\ v is the L v norm of the response 
F<(w). The appropriate value of p is determined by assumed conditions 
on the spectra of the input signals to the filter. The effect of (70) is to 
bound the maximum signal amplitude (for deterministic inputs) or the 
maximum average power (for random inputs) at the i th branch node. 
A state-variable description was employed to formulate the general 
concept of "transpose configurations" for a digital network and to 
illustrate the usefulness of this concept in digital-filter synthesis. A 
particularly important result is that for a given unsealed configuration 
with transpose responses F t (a) and (?*(«), as described above, the re- 
sponses F' t (ta) and G|(w) for the corresponding transpose configuration 
are given by 

F f M = (?,(«) and G}(«) = F,(a). (71) 

Hence, although the overall transfer functions for these two configura- 
tions are the same, their roundoff-noise outputs can be quite different, 
in general. The transpose configuration is obtained by simply reversing 
the direction of all branches in the given network configuration, and 
the poles and zeros of the network are thus realized in reverse order in 
the transpose configuration. 

To illustrate these results, the roundoff-noise spectra N y (u>) for two 
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transpose configurations for the direct form of a digital filter were cal- 
culated and compared. The direct form should usually be avoided in 
practice, 8 but it is still of theoretical interest and provides a convenient 
example of our general approach. Using a very natural assignment of 
the indices i and j for the unsealed F,(o>) and Gj(w), equation (69) was 
shown to be of the form 

NM = al\k' il+1 + £ V, || F, |B | (?,(«) | 2 } (72) 

for these (scaled) configurations for the direct form, where M is the 
number of error inputs at other than the output of the filter. Hence, the 
variance, or total average power, of the output roundoff noise is simply 

«!-o${*Sr + .+ f>ni^llpll<?.-ll°}; (73) 

while the peak spectral density || N v \\„ is bounded by 

II AT. ||. £ *o{fc£r + , + If *i II *# II! II G * I'-}" (?4) 

Identical expressions to (72) through (74) can also be derived for the 
parallel and cascade forms of a digital filter. 1 The relationship between 
the noise outputs of corresponding transpose configurations is immedi- 
ately indicated by (71) through (74) [although, in general, /c< ^ fcj 1 ]. 
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